3 Data Preparation#

Hide code cell source
import pandas as pd
import matplotlib.pyplot as plt
import os
import numpy as np

# get custom color palette and colormap
from eda_helper import get_custom_palette, get_custom_colormap
Hide code cell source
# create the file paths for reading in data and for outputting figures and tables
DATA_PATH = "../data/saville_row_east_west/"
OUTPUT_TABLES_PATH = "../output/tables/4/"
OUTPUT_FIGURES_PATH = "../output/figures/4/"

os.makedirs(DATA_PATH, exist_ok=True)
os.makedirs(OUTPUT_TABLES_PATH, exist_ok=True)
os.makedirs(OUTPUT_FIGURES_PATH, exist_ok=True)

custom_palette = get_custom_palette()
custom_colormap = get_custom_colormap()

# read in the files for exploration
east_df = pd.read_pickle(os.path.join(DATA_PATH, "east_df.pkl"))
west_df = pd.read_pickle(os.path.join(DATA_PATH, "west_df.pkl"))

west_complete_days_df = pd.read_pickle(
    os.path.join(DATA_PATH, "west_complete_days_df.pkl")
)
east_complete_days_df = pd.read_pickle(
    os.path.join(DATA_PATH, "east_complete_days_df.pkl")
)

# group on dt to remove trajectories
east_complete_days_df = east_complete_days_df.groupby("dt").agg(
    {"value": "sum", "date": "first"}
)
west_complete_days_df = west_complete_days_df.groupby("dt").agg(
    {"value": "sum", "date": "first"}
)

3.1 Data Selection Rationale#

  • Describe and justify the criteria and methods used to select the data from the initial dataset(s).

  • Discuss the relevance of the selected data to the data mining goals defined earlier.

  • Explain any assumptions made in this process and any implications for the final results.

  • Include the reasons for including or excluding certain attributes.

For the time-series we are going to search for and then select a full week of data.

Hide code cell source
year = 2023

# Get unique week numbers in the data
unique_week_numbers = east_complete_days_df["date"].dt.isocalendar().week.unique()

# Loop through each week number
for week_number in unique_week_numbers:
    # Select data for the specific week and year
    ts_data = east_complete_days_df[
        (east_complete_days_df["date"].dt.isocalendar().week == week_number)
        & (east_complete_days_df["date"].dt.year == year)
    ]

    # Check if the length of the selected week matches the expected length
    is_complete = (4 * 24 * 7) == len(ts_data)

    # Print the week number if it is complete
    if is_complete:
        print(f"Week {week_number} is complete.")
        break

print(
    f"Week {week_number} data stored in `df_selected_week` with length {len(ts_data)}"
)

plt.plot(ts_data["value"])
plt.xticks(rotation=90)
plt.title(f"Week {week_number} of {year}")

timeseries = ts_data[["value"]].values.astype("float32")
Week 6 is complete.
Week 6 data stored in `df_selected_week` with length 672
_images/6ae8d29e0f160ce140eb8459771956d60bce5aa2ae196feb231f790cf69b0a01.png

3.2 Data Cleaning Report#

  • Detail the process of cleaning the data i.e. dealing missing values, duplicate records, or inconsistent entries.

  • Discuss the methods used for cleaning (e.g., deletion, imputation, etc.) and a justification of choices.

  • Highlight any challenges faced during the data cleaning process.

  • Provide before and after summaries to demonstrate the effects of our cleaning efforts.

3.3 Derived Attributes / Records#

  • Discuss any new attributes or records that were derived from existing ones, such as creating new features through binning, combining attributes, or performing calculations.

  • Explain why these transformations were done, and how these new attributes will contribute to the data mining goals.

  • If applicable, describe any generation of synthetic data or aggregation of data at different levels.

We are going to add a number of new features that are derived from existing data. These are periods for the strongest signals calculated using the Lomb-Scargle Periodogram. The purpose of this is to give context to the model for the position of values in the time series. This is especially important as there are a large number of missing days now that all non-complete days have been removed.

3.4 Merged Data#

  • Describe any data that was merged or joined together from different sources or tables.

  • Discuss the methods used to combine the data and any difficulties faced (like inconsistencies or discrepancies between the datasets).

  • Explain how any issues were resolved (e.g., how discrepancies were reconciled).

  • Detail how the merged data was validated to ensure correctness.

We are going to add the term dates to the data for use in the model. This will be a boolean value.

3.5 Reformatted Data#

  • Describe the process of reformatting the data for modeling or visualization purposes.

  • Discuss any transformations of the data into different types (e.g., numerical to categorical, date-time conversions), creation of dummy variables, or re-scaling of features.

  • Explain how and why any re-sampling procedures were done, if applicable.

  • Discuss the format that the final dataset(s) will be in for the subsequent modeling or visualization stage.

First we need to scale the data to improve GD (Gradient Descent) convergence and avoid vanishing/exploding gradients both of which slow down model training. We will use sklearn.preprocessing.MinMaxScaler as our data is discrete and follows a heavily right skewed mixture distribution with a long tail and an exponential component. It is important that we preserve the shape of original distribution.

Hide code cell source
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
scaled_timeseries = scaler.fit_transform(timeseries)

Model 0 - Baseline#

This model takes the current value as input and returns it for the next time step, effectively predicting “no change”. As our model can currently only predict a single time step into the future and the prediction will not change as window length increases. We can add an option to increase the number of predictions that the model makes however and see how performance changes as the number of time steps is increased. For single time-steps the baseline should be a reasonable estimate as the number of pedestrians generally changes slowly throughout the day. First we need to change our pandas.core.series.Series object into a numpy.ndarray object. We are dealing with a single dimensional array and the values form a time vector i.e. they are an ordered list.

We then want to split the data into windows. In our first test run, we use a week of continuous data, and split it into windows of WINDOWSIZE=1 and HORIZON=1. The baseline model will take the input and return that value as the output. We will then calculate some scores to see how far the predictions are from the labels.

Hide code cell source
from IPython import display
import os

display.Image(
    os.path.join(os.getcwd(), "graphics\sliding_window_w3h3s1.png"),
    width=768,
    height=768,
)
_images/4673bc78bca5dff4e5b407ddcaf8efa2096c1f8b607c9d5aa5f06be0bb246c9f.png
Hide code cell source
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import matplotlib.animation as animation
import IPython.display as ipy
from IPython.display import HTML
from eda_helper import sliding_windows, plot_windows


WINDOW_SIZE = 4
HORIZON = 4
STRIDE = 4
inputs, targets = sliding_windows(scaled_timeseries, WINDOW_SIZE, HORIZON, STRIDE)
print(
    f"Inputs array shape: {inputs.squeeze().shape} | Target array shape: {targets.squeeze().shape}"
)
print(f"Inputs type: {type(inputs)} | Targets type: {type(targets)}")


class Model0(nn.Module):
    def __init__(self, horizon):
        super().__init__()
        self.horizon = horizon

    def forward(self, x):
        # Get the last element from each input sequence
        last_values = x[:, -1:]
        # Repeat this value to match the horizon length
        return last_values.expand(-1, self.horizon)


baseline = Model0(HORIZON)

predictions = baseline(inputs.squeeze())

plot_windows(inputs, predictions, targets, num_plots=2, step=1)
Input shape: (4, 1) | Target shape: (4, 1)
Inputs array shape: torch.Size([167, 4]) | Target array shape: torch.Size([167, 4])
Inputs type: <class 'torch.Tensor'> | Targets type: <class 'torch.Tensor'>
_images/3cf4134783ea8d272efe71608e248e5d33b283e55d763835be56aa3fb6fa1ffe.png
Hide code cell source
plt.ioff()  # Turn off interactive plotting

fig, ax = plt.subplots(1, 1, figsize=(6, 4))


def animate(i):
    ax.clear()
    ax.plot(
        range(len(inputs[i])),
        inputs[i],
        label="Inputs",
        color=custom_palette[0],
    )
    ax.scatter(
        range(len(inputs[i]), len(inputs[i]) + len(predictions[i])),
        predictions[i],
        label="Predictions",
        color=custom_palette[1],
        marker="x",
        s=60,
    )
    ax.scatter(
        range(len(inputs[i]), len(inputs[i]) + len(targets[i])),
        targets[i],
        label="Targets",
        color=custom_palette[0],
    )
    ax.legend()
    ax.set_title(f"Window {i+1}")
    ax.set_ylim(0, inputs.max())


# Assign the FuncAnimation to a variable
ani = animation.FuncAnimation(fig, animate, frames=len(inputs[:96]), repeat=True)

# Create a display object
display_obj = HTML(ani.to_jshtml())

# Display the object
ipy.display(display_obj)

Model 0 Performance#

Hide code cell source
targets.squeeze().flatten().shape, predictions.flatten().shape
(torch.Size([668]), torch.Size([668]))
Hide code cell source
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Flattening predictions and targets for metric calculations
predictions_flat_np = predictions.flatten().detach().numpy()
targets_flat_np = targets.flatten().detach().numpy()

# Mean Absolute Error (MAE)
mae = mean_absolute_error(targets_flat_np, predictions_flat_np)
print(f"Mean Absolute Error (MAE): {mae}")

# Mean Squared Error (MSE)
mse = mean_squared_error(targets_flat_np, predictions_flat_np)
print(f"Mean Squared Error (MSE): {mse}")

# Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error (RMSE): {rmse}")

# Symmetric Mean Absolute Percentage Error (SMAPE)
epsilon = 1e-10  # stops divide by zero error
smape = (
    np.mean(
        2.0
        * np.abs(targets_flat_np - predictions_flat_np)
        / (np.abs(targets_flat_np) + np.abs(predictions_flat_np) + epsilon)
    )
    * 100
)
print(f"Symmetric Mean Absolute Percentage Error (SMAPE): {smape}%")

# R-squared (R²)
r2 = r2_score(targets_flat_np, predictions_flat_np)
print(f"R-squared (R²): {r2}")

# Create a dictionary with the metrics
metrics = {
    "Model": "Baseline",
    "MAE": mae,
    "MSE": mse,
    "RMSE": rmse,
    "SMAPE": smape,
    "R^2": r2,
}

performance_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "SMAPE", "R^2"])

# Append the dictionary to the DataFrame
metrics_df = pd.DataFrame(metrics, index=[0])
performance_df = pd.concat([performance_df, metrics_df])

# Print the DataFrame
print(performance_df.round(3))
Mean Absolute Error (MAE): 0.05719786882400513
Mean Squared Error (MSE): 0.0077348146587610245
Root Mean Squared Error (RMSE): 0.0879477933049202
Symmetric Mean Absolute Percentage Error (SMAPE): 42.62365400791168%
R-squared (R²): 0.8907532938409718
      Model    MAE    MSE   RMSE   SMAPE    R^2
0  Baseline  0.057  0.008  0.088  42.624  0.891

Model 1: Simple LSTM#

For our first LSTM model, we will use a single LSTM layer and a linear output layer. The model is only capable of providing a single time step output, so the distance we want to predict into the future depends on the frequency of data points that we feed into out from our generated windows. To begin with we attempt to predict 15 minutes into the future as this is the highest frequency of measurements we have for training. Later on we may aggregate to hourly or daily measurements to allow us to predict further into the future.

Hide code cell source
train_size = int(len(scaled_timeseries) * 0.67)
test_size = len(scaled_timeseries) - train_size
train, test = scaled_timeseries[:train_size], scaled_timeseries[train_size:]

WINDOW_SIZE = 40
HORIZON = 1

X_train, y_train = sliding_windows(train, WINDOW_SIZE, HORIZON)
X_test, y_test = sliding_windows(test, WINDOW_SIZE, HORIZON)
print()
print(f"Train input tensor: {X_train.shape} | Train target tensor: {y_train.shape}")
print()
print(f"Test input tensor: {X_test.shape} | Test target tensor: {y_test.shape}")
Input shape: (40, 1) | Target shape: (1, 1)
Input shape: (40, 1) | Target shape: (1, 1)

Train input tensor: torch.Size([410, 40, 1]) | Train target tensor: torch.Size([410, 1, 1])

Test input tensor: torch.Size([182, 40, 1]) | Test target tensor: torch.Size([182, 1, 1])
Hide code cell source
class LSTMModel1(nn.Module):
    def __init__(self, *args, **kwargs) -> torch.Tensor:
        super().__init__(*args, **kwargs)
        self.lstm = nn.LSTM(
            input_size=1, hidden_size=50, num_layers=1, batch_first=True
        )
        self.linear = nn.Linear(50, 1)

    def forward(self, x):
        x, _ = self.lstm(x)
        x = self.linear(x)
        return x[:, -1, :]  # Selecting the last output of the sequence
Hide code cell source
from matplotlib import animation
from IPython.display import HTML
import matplotlib.pyplot as plt
import matplotlib


# A function to train the model and record the training data
def train_model_and_record_data(
    X_train,
    y_train,
    X_test,
    y_test,
    window_size=WINDOW_SIZE,
    epochs=EPOCHS,
    device=device,
):
    LSTM_v1 = LSTMModel1().to(device)
    optimiser = optim.Adam(LSTM_v1.parameters())
    loss_fn = nn.MSELoss()
    loader = data.DataLoader(
        data.TensorDataset(X_train, y_train.squeeze(-1)), shuffle=True, batch_size=8
    )

    training_data_per_window = []

    for epoch in range(epochs):
        LSTM_v1.train()
        for X_batch, y_batch in loader:
            y_pred = LSTM_v1(X_batch)
            loss = loss_fn(y_pred, y_batch)
            optimiser.zero_grad()
            loss.backward()
            optimiser.step()

        LSTM_v1.eval()
        with torch.inference_mode():
            y_pred = LSTM_v1(X_train)
            train_rmse = np.sqrt(loss_fn(y_pred, y_train.squeeze(-1))).item()
            test_pred = LSTM_v1(X_test)
            test_rmse = np.sqrt(loss_fn(test_pred, y_test.squeeze(-1))).item()

            train_plot = np.ones_like(timeseries) * np.nan
            y_pred = LSTM_v1(X_train)
            y_pred = y_pred[:, -1]
            train_plot[window_size:train_size] = np.expand_dims(
                y_pred.detach().numpy(), axis=1
            )

            test_plot = np.ones_like(timeseries) * np.nan
            test_pred = LSTM_v1(X_test)[:, -1]
            test_plot[train_size + window_size : len(timeseries)] = np.expand_dims(
                test_pred.detach().numpy(), axis=1
            )

            # Add the data for this epoch to the list
            training_data_per_window.append(
                (epoch, train_plot, test_plot, train_rmse, test_rmse)
            )

            if (epoch + 1) % 10 != 0:
                continue

            print(
                f"Window Size: {window_size} | Epoch: {epoch+1} | Train RMSE: {train_rmse:.4f} | Test RMSE: {test_rmse:.4f}"
            )

    return training_data_per_window
Hide code cell source
# Main code
training_data = []

device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42)

EPOCHS = 20
WINDOW_SIZES = [2, 3, 4, 6, 8, 10, 15, 20]

for window_size in WINDOW_SIZES:
    X_train, y_train = sliding_windows(train, window_size, HORIZON)
    X_test, y_test = sliding_windows(test, window_size, HORIZON)

    training_data_per_window = train_model_and_record_data(
        X_train, y_train, X_test, y_test, window_size, EPOCHS, device
    )

    # Add the data for this window to the main training_data list
    training_data.extend([(window_size,) + data for data in training_data_per_window])

# Set the animation embed limit
matplotlib.rcParams["animation.embed_limit"] = 2**100

# Create the initial plot
fig, ax = plt.subplots()

x_coords = np.arange(len(scaled_timeseries))

(line1,) = ax.plot(x_coords, scaled_timeseries, c=custom_palette[0])
(line2,) = ax.plot(
    x_coords, np.full_like(scaled_timeseries, np.nan), c=custom_palette[1]
)
(line3,) = ax.plot(
    x_coords, np.full_like(scaled_timeseries, np.nan), c=custom_palette[2]
)


# Animation update function
def animate(i):
    line2.set_ydata(training_data[i][2])
    line3.set_ydata(training_data[i][3])
    ax.set_title(
        f"Window Size: {training_data[i][0]} | Epoch: {training_data[i][1]+1} | Train RMSE: {training_data[i][4]:.4f} | Test RMSE: {training_data[i][5]:.4f}"
    )
    return (
        line1,
        line2,
        line3,
    )


# Create the animation
ani = animation.FuncAnimation(
    fig, animate, frames=len(training_data), interval=100, blit=True, repeat=True
)

# Display the animation
HTML(ani.to_jshtml())
Input shape: (2, 1) | Target shape: (1, 1)
Input shape: (2, 1) | Target shape: (1, 1)
Input shape: (3, 1) | Target shape: (1, 1)
Input shape: (3, 1) | Target shape: (1, 1)
Input shape: (4, 1) | Target shape: (1, 1)
Input shape: (4, 1) | Target shape: (1, 1)
Input shape: (6, 1) | Target shape: (1, 1)
Input shape: (6, 1) | Target shape: (1, 1)
Input shape: (8, 1) | Target shape: (1, 1)
Input shape: (8, 1) | Target shape: (1, 1)
Input shape: (10, 1) | Target shape: (1, 1)
Input shape: (10, 1) | Target shape: (1, 1)
Input shape: (15, 1) | Target shape: (1, 1)
Input shape: (15, 1) | Target shape: (1, 1)
Input shape: (20, 1) | Target shape: (1, 1)
Input shape: (20, 1) | Target shape: (1, 1)
Hide code cell source
import matplotlib

matplotlib.rcParams["animation.embed_limit"] = 2**100
# Initial plot
fig, ax = plt.subplots()

x_coords = np.arange(len(scaled_timeseries))


(line1,) = ax.plot(x_coords, scaled_timeseries, c=custom_palette[0])
(line2,) = ax.plot(
    x_coords, np.full_like(scaled_timeseries, np.nan), c=custom_palette[1]
)
(line3,) = ax.plot(
    x_coords, np.full_like(scaled_timeseries, np.nan), c=custom_palette[2]
)


def animate(i):
    # Update the data for each line
    line2.set_ydata(training_data[i][2])
    line3.set_ydata(training_data[i][3])

    # Update the title with the current RMSE
    ax.set_title(
        f"Window Size: {training_data[i][0]} | Epoch: {training_data[i][1]} | Train RMSE: {training_data[i][4]:.4f} | Test RMSE: {training_data[i][5]:.4f}"
    )

    return (
        line1,
        line2,
        line3,
    )


# Create the animation
ani = animation.FuncAnimation(
    fig, animate, frames=len(training_data), interval=100, blit=True, repeat=True
)

# Display the animation
HTML(ani.to_jshtml())
Hide code cell source
y_pred = LSTM_v1(X_test)
X_test.shape, y_pred.unsqueeze(2).shape, y_test.shape
(torch.Size([209, 10, 1]), torch.Size([209, 1, 1]), torch.Size([209, 4, 1]))
Hide code cell source
plt.ioff()  # Turn off interactive plotting

fig, ax = plt.subplots(1, 1, figsize=(6, 4))


def animate(i):
    ax.clear()
    ax.plot(
        range(len(inputs[i])),
        inputs[i],
        label="Inputs",
        color=custom_palette[0],
    )
    ax.scatter(
        range(len(inputs[i]), len(inputs[i]) + len(predictions[i])),
        predictions[i],
        label="Predictions",
        color=custom_palette[1],
        marker="x",
        s=60,
    )
    ax.scatter(
        range(len(inputs[i]), len(inputs[i]) + len(targets[i])),
        targets[i],
        label="Targets",
        color=custom_palette[0],
    )
    ax.legend()
    ax.set_title(f"Window {i+1}")
    ax.set_ylim(0, inputs.max())


# Assign the FuncAnimation to a variable
ani = animation.FuncAnimation(fig, animate, frames=len(inputs[:96]), repeat=True)

# Create a display object
display_obj = HTML(ani.to_jshtml())

# Display the object
ipy.display(display_obj)
Hide code cell source
plot_windows(X_test, y_pred.unsqueeze(2).detach().numpy(), y_test, num_plots=4, step=1)
_images/886f13da5cc88062374661864a9756669ec0da6f10d7e5bce0b1647450bcdaa7.png

Windowing, also known as “rolling windows” or “sliding windows,” is a common method used in time series analysis and is often used with machine learning models. However, whether it’s essential or not depends largely on the specific task, data characteristics, and the model being used.

Here are some points to consider:

Sequence Learning: Many machine learning models, especially those used for time series like Recurrent Neural Networks (RNNs) and its variants (like LSTM and GRU), require input in the form of sequences. They make predictions based on the “memory” of past data points. Windowing is a way of creating these sequences from a time series.

Stationarity: Windowing can help make a time series more stationary (i.e., its properties do not depend on the time at which the series is observed) by creating a series of differences between values in the window, which is a requirement for certain models.

Feature Engineering: Windowing can also be used for feature engineering, where you create new features based on a window of values - such as rolling averages, standard deviations, etc. This can help capture trends and seasonality in the data which can be crucial for the performance of the models.

Reduce Noise: A windowed average can sometimes help reduce noise and smooth out short-term fluctuations, which may be beneficial for the model.

Temporal Dependencies: In time-series problems, current outputs can be dependent on previous inputs or outputs (auto-regressive nature). Windowing allows capturing these temporal dependencies.

However, it’s worth noting that windowing is just one of many techniques in time series analysis. Depending on the nature of your problem, other techniques may also be applicable and useful. It is always recommended to understand your data and problem thoroughly and choose the techniques that are most appropriate for your specific use case.

Aside from windowing, there are various techniques used in time series analysis, including:

  • Autoregression (AR): Models the next step in the sequence as a linear function of the observations at prior time steps.

  • Moving Average (MA): Models the next step in the sequence as a linear function of the residual errors from a mean process at prior time steps.

  • Autoregressive Moving Average (ARMA): Models the next step in the sequence as a linear function of the observations and residual errors at prior time steps.

  • Autoregressive Integrated Moving Average (ARIMA): An extension of ARMA that can model a wide range of time series data.

  • Seasonal Autoregressive Integrated Moving-Average (SARIMA): An extension of ARIMA that explicitly supports univariate time series data with a seasonal component.

  • Vector Autoregression (VAR): Models the next step in each time series using an AR model.

  • Vector Autoregression Moving-Average (VARMA): Models the next step in each time series using an ARMA model.

  • Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX): An extension of the VARMA model that also includes the modeling of exogenous variables.

  • Simple Exponential Smoothing (SES): Models the next time step as an exponentially weighted linear function of observations at prior time steps.

  • Holt Winter’s Exponential Smoothing (HWES): Models the next step in the sequence as an exponentially weighted linear function of observations at prior time steps, taking trends and seasonality into account.